## [1] 4898 13
Our data set consists of 13 variables, with 4,898 observations.
The data set is related to white “Vinho Verde” Portuguese wine.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Description of attributes:
This report explores a data set containing quality and attributes for 4,898 white wines.
Let’s first transform the quality variable into a factor variable. Let’s also create a numeric variable as the quality variable in order to be able to make plots with numerical data.
Check if the quality is a factor variable.
## [1] TRUE
There are 4898 observations spread into seven quality categories, from 3 (low quality) to 9 (good quality).
Let’s investigate how many observations are in each quality category.
Counts in each category:
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Let’s better visualize how many counts are in each category.
From the above bar chart we can see that most of our white wines have a quality of six with almost half of the observations, 2198. Close are the wines rated five, with 1457 samples and seven with 880 samples. We can see that there are very few samples of very good white wines. Only three were rated nine. On the other side, only 20 were poor with a rating of three. This can cause bias to our results.
To better get a feel of the quality variable, let’s take a look at the percentage of each category from total.
These are the actual numbers:
## 3 4 5 6 7 8
## 0.4083299 3.3278889 29.7468354 44.8754594 17.9665169 3.5728869
## 9
## 0.1020825
We can see that the data set is not well balanced in relation to how many observations are in each quality category. This can cause bias to our results and analysis. The optimal data set should include almost the same number of observations per each category. The middle category, 6, has almost half of the observations, 44.88%. The most extreme categories, 3 and 9 count for only 0.51% from the total number of. Wine quality, 4 and 5 count each for around 3.5% of the the data set, 5, with 29,75% and 7 category with 17.97%.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
The middle of our data set is quality 6, which makes sense since it counts for almost half of the observations. The average quality for our data set is slightly less, at 5.88. This is due to the fact that there are more observations with a quality less than 6, with 33.48%, while those with a quality higher count for 21.64% of the total observations.
Let’s create only three levels of quality:
low for quality 3, 4 and 5;medium for quality 6high for quality 7, 8 and 9quality.type.low, medium and premium.Now we have this new variable with only three levels to categorize the white wine quality.
## [1] "low" "medium" "premium"
## low medium premium
## 33.48305 44.87546 21.64149
Although this is not optimal, we can now see that these three categories are more balanced with:
33.48% low quality44.99% medium quality21.64% premium qualityUsing these categories, we will try to understand important features for the wine quality classification.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
From the above histogram we can see that the distribution of fixed.acidity looks normal. There are some slightly extreme points of fixed acidity on the right of the distribution. The peak of our distribution is around 6.8 g/dm^3. Fifty percent of the wines have a fixed acidity that ranges from 6.3 to 7.3 g/dm^3. Let’s facet the data to see it how it looks based on quality.
The premium category doesn’t have wines with a fixed acidity more than 9.2 g/dm^, with an average of 6.73 g/dm^3, less than the medium and low quality white wines. Rather than that, the distributions looks pretty normal with some extreme values is the low an medium quality white wines.
## pf$quality.type: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.200 6.400 6.800 6.962 7.500 11.800
## --------------------------------------------------------
## pf$quality.type: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.838 7.300 14.200
## --------------------------------------------------------
## pf$quality.type: premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.900 6.200 6.700 6.725 7.200 9.200
If we zoom in, we can see that the average and median fixed acidity is lowest for the premium quality white wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
The Volatile Acidity Histogram distribution is pulled to the right by bigger values, with a mean of 2.78 g/dm^3, higher than the median value of 2.60 g/dm^3. In this data set, white wines have a volatile acidity ranging from 0.08 to 0.32 g/dm^3.
Transforming our data with a log 10 scale, we can see that our data is much more normal.
Let’s facet wrap our data to better see what is happening per quality and apply a log scale.
The biggest values of volatile acidity are in the low quality white wines, with a median value of 0.29 g/dm^3. Although the values of volatile acidity for the medium quality white wines are more spread out, ranging from 0.08 to 0.9680 compared to premium wines, ranging from 0.08 to 0.76 g/dm^3, they have the share the same median value of 0.25 g/dm^3.
## pf$quality.type: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1000 0.2400 0.2900 0.3103 0.3500 1.1000
## --------------------------------------------------------
## pf$quality.type: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2000 0.2500 0.2606 0.3000 0.9650
## --------------------------------------------------------
## pf$quality.type: premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.1900 0.2500 0.2653 0.3200 0.7600
We can see here that low quality white wines have a bigger median value for volatile acidity but they share the same median for medium and premium quality white wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
We can see that there are some extreme points in our data set, which skews our data set to the right. In particular, we can see a maximum value of 1.66 g/dm^3 citric acid.
Based on the above histograms, we can identify that the medium quality white wines have the most extreme data points but its median value of 0.32 is the same as the low quality white wines, slightly bigger than premium white wines which have a median value of 0.31 g/dm^3.
## pf$quality.type: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2400 0.3200 0.3343 0.4100 1.0000
## --------------------------------------------------------
## pf$quality.type: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.270 0.320 0.338 0.380 1.660
## --------------------------------------------------------
## pf$quality.type: premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.2800 0.3100 0.3261 0.3600 0.7400
We can see that low quality wines are more spread but there is no big difference in median and mean values.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
We can see that the distribution of residual sugar is right skewed with a long tail. In fact 25 percent of the data points are ranging from 9.9 to some very sweet wines of 65.8 g/dm^3. A white wine has, on average, 6.4 g/dm^3 residual sugar.
The transformed data shows a bimodal distribution, with a peak at around 2 and the other one at around 10. Perhaps this is due to the fact that we have dry wines and more sweet wines. Perhaps it would be a good idea to categorize these values and take a look if this has an impact on the quality.
We can see this bimodal distribution persists at the quality level. We can see that low quality has more sweeter white wines, with a median value of 6.63 g/dm^3. Although the medium quality white wines have the sweetest white wines, with a 65.8 g/dm^3 residual sugar, the median value is not bigger the low quality white wines. Premium quality white wines are more compact, with less residual sugar in this category, ranging from 0.8 to 19.25, with a median value of 3.88 g/dm^3.
## pf$quality.type: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 6.625 7.054 11.025 23.500
## --------------------------------------------------------
## pf$quality.type: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.700 5.300 6.442 9.900 65.800
## --------------------------------------------------------
## pf$quality.type: premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 1.800 3.875 5.262 7.400 19.250
Here we can better see the extreme data points in the medium quality white wines. Also, we can see that the low quality wines have a bigger interquartile range. In contrast, the premium quality white wines have a smaller range of values and a slightly smaller median residual sugar value.
Let’s cut the residual.sugar variable into five categories.
There are 5 white wines categories:
bin_edges = [2.72, 3.09, 3.18, 3.28, 3.82]
Labels for the four acidity level groups bin_names = [‘high’, ‘mod_high’, ‘medium’, ‘low’]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
##
## extra_dry dry semi-dry semi_sweet sweet
## 2410 1295 1175 15 3
We can see that most wines are extra dry and only a few are in the sweeter categories.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Most of the white wines have a chloride level between 0.036 - 0.05 g/dm^3. The average chloride in white wine is 0.46 g/dm^2 and is slightly bigger than the median value due to white wines with higher chloride.
If we look at the chlorides distribution for each quality category, we can see that the premium quality chlorides histogram has fewer extreme data points, with a mean chloride of 0.38.
## pf$quality.type: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.04000 0.04700 0.05144 0.05300 0.34600
## --------------------------------------------------------
## pf$quality.type: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500
## --------------------------------------------------------
## pf$quality.type: premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.03100 0.03700 0.03816 0.04400 0.13500
We can better see here how the median and average chloride values are lower for medium and premium white wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
We can see that the values of free sulfur dioxide for the low quality white wines are more spread out, ranging from 2 - 289 mg/dm^3 while the medium and premium quality white wines have less high free sulfur dioxide levels with less than 112 for medium quality white wines and 108 for premium white quality white wines.
## pf$quality.type: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 20.00 34.00 35.34 49.00 289.00
## --------------------------------------------------------
## pf$quality.type: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 24.00 34.00 35.65 46.00 112.00
## --------------------------------------------------------
## pf$quality.type: premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 25.00 33.00 34.55 42.00 108.00
From these box plots we can see that although low quality wines have more high free sulfur dioxide wines, they share almost the same median and almost the same average.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
There are some extreme white wines with more total sulfur dioxide with a maximum of 440 g/dm^3.
We can see that the most extreme values for total sulfur dioxide are in the low quality white wines.
We can see that the medium and low quality wines have more spread out data. The median and the average values for total sulfur dioxide are less for premium white wines and bigger for medium and low quality wines. I will analyse more this variable in the next section.
## pf$quality.type: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 117.0 149.0 148.6 182.0 440.0
## --------------------------------------------------------
## pf$quality.type: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.0 107.2 132.0 137.0 164.0 294.0
## --------------------------------------------------------
## pf$quality.type: premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34.0 101.0 122.0 125.2 146.0 229.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The density distribution ranges from a minimum of 0.987 g/dm^3 to a maximum of 1.039 g/dm^3 with an average equal to the median value of around 0.994 g/dm^3. This indicates normally distributed data.
We can see here that most of our premium quality white wines have lower density with less data above 0.994. Also, we can also see two peaks for premium quality white wines, one near 0.992 and the other one around 0.997 g/dm^3.
From the box plots we can see that the mean and the median density values decrease per each quality level. Density might be a good feature to predict good wine quality. I will focus on density in the next parts.
## pf$quality.type: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9932 0.9951 0.9952 0.9971 1.0024
## --------------------------------------------------------
## pf$quality.type: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9876 0.9917 0.9937 0.9940 0.9959 1.0390
## --------------------------------------------------------
## pf$quality.type: premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9905 0.9917 0.9924 0.9936 1.0006
Let’s cut the density variable to create a density.levels factor with two levels: low_density, high_density.
##
## low_density high_density
## 2442 2456
From the above bar charts it seems that premium quality white wines have lower density observations than low quality white wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
The pH distribution looks quite normal, with the median of 3.18 value almost equal to the average of 3.19
There are some extreme data points in each category, with slightly bigger values for the premium category.
We can see that both the medians and the means are slightly higher for each quality.
## pf$quality.type: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.79 3.08 3.16 3.17 3.24 3.79
## --------------------------------------------------------
## pf$quality.type: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.080 3.180 3.189 3.280 3.810
## --------------------------------------------------------
## pf$quality.type: premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.840 3.100 3.200 3.215 3.320 3.820
Let’s create a pH.levels factor variables with four levels:
##
## high mod_high medium low
## 1314 1246 1156 1182
Average quality Ratings by Acidity Levels
## pf$pH.levels: high
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.799 6.000 8.000
## --------------------------------------------------------
## pf$pH.levels: mod_high
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.784 6.000 8.000
## --------------------------------------------------------
## pf$pH.levels: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.889 6.000 9.000
## --------------------------------------------------------
## pf$pH.levels: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 6.053 7.000 9.000
We can see that premium white wines have lower pH levels than medium and low quality white wines as compared to the other pH levels.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
The sulphates histogram of white wines is right skewed, with the median, 0.47, being smaller than the average sulphates of 0.49 g/dm^3. The values ranges from a minimum of 0.22 to a maximum of 1.08 g/dm^3.
The transformed data looks more normally distributed, with the bulk of our data between 0.41 to 0.55 g/dm^3.
We can see that the premium quality white wine sulphates values are more spread out but the median value is equal to the premium quality wines, at 4.8 g/dm^3, slightly higher than low quality white wines which have a median value of 0.47.
## pf$quality.type: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2500 0.4100 0.4700 0.4815 0.5300 0.8800
## --------------------------------------------------------
## pf$quality.type: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.4100 0.4800 0.4911 0.5500 1.0600
## --------------------------------------------------------
## pf$quality.type: premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4000 0.4800 0.5001 0.5800 1.0800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
We can see that the alcohol values ranges from a minimum of 8% to 14.2 %. On average, a white wine has 10.5% alcohol.
Here we can see that most low quality white wines have an alcohol level less than 10.4% while the medium quality white wines is below 11.4 and the premium below 12.4%.
## pf$quality.type: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.20 9.60 9.85 10.40 13.60
## --------------------------------------------------------
## pf$quality.type: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## pf$quality.type: premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 10.70 11.50 11.42 12.40 14.20
We can see that as white wine quality increases, the median and mean alcohol content increases. This might indicate a relation between alcohol and quality. Premium quality white wines have, on average, more alcohol than those low and medium Quality Types. Actually, alcohol might be a good predictor of wine quality and does need a closer look.
Let’s create for alcohol three groups to see which one receives better ratings.
To answer this question, I will create three groups of wine samples:
low < 11%moderate < 13%high > 13%## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
##
## low low_mod moderate mod_high high highest
## 502 1583 1252 850 609 102
Here we can see that premium quality white wines have more high alcohol content.
There are 4,898 white wines in our data set with 11 input variables and one output variable (based on sensory data):
quality variable into a factor variable.I created a new variable in the data set quality.nr which is a numeric variable.
quality.type level.pH distribution looks normal with a median value of 3.18. Premium quality wines have, on average, a higher pH.On average, a white wine has 10.5% alcohol.
The main features of interest are alcohol and density. I suspect that alcohol and /or density and some combination of the other variables can be used to build a predictive model for white wine quality.
Fixed acidity, volatile acidity, residual sugar, free sulfur dioxide and pH may help in determining quality.
quality variable I created two other variables. First, a numeric variable, quality.nr. Second, a factor variable called quality.type with three levels: low, medium and premium. In the low category I included the 3, 4, 5 quality categories. In the medium, the 6 quality category and in the premium, the 7, 8 and 9 quality levels.extra_dry, dry, semi-dry, semi_sweet, “sweet”. This variable is cut from residual.sugar based on the fact that after applying a log scale we saw a bimodal distribution and maybe there are some patterns related to quality, based on sugar levels that can be further analysed.low_density, high_density. This variable is cut from the density variable in order to better get a sense of how low or high density relates to white wine quality.high, moderately high, medium and low.low, low_mod, moderate, mod_high, high, highest.For example, I saw that transforming residual sugar with a log scale the distribution becomes bimodal. This might indicate that there are categories of dryer to sweet white wine with less residual sugar for premium quality white wines.
From this matrix we can better see the relationships between each of the two variables from our data set. Let’s also take a look at the Correlation Coefficient of quality versus the rest of the variables.
## [,1]
## fixed.acidity -0.113662831
## volatile.acidity -0.194722969
## citric.acid -0.009209091
## residual.sugar -0.097576829
## chlorides -0.209934411
## free.sulfur.dioxide 0.008158067
## total.sulfur.dioxide -0.174737218
## density -0.307123313
## pH 0.099427246
## sulphates 0.053677877
## alcohol 0.435574715
From the above matrix we can see that quality has the highest Correlation Coefficient with:
alcohol, 0.44 which might indicate a moderate positive linear relationshipdensity, -0.31 which might indicate a moderate negative linear relationshipchloride, volatile.acidity, total.sulfur.dioxide all might indicate weak negative linear relationship.We have to also pay attention to the fact that alcohol and density are correlated to each other. They have a Correlation Coefficient of -0.78 which might indicate a strong negative linear relationship between them. When building a Linear Regression models, we have to be careful about multicollinearity, which means that independent variables must not be correlated. Density and residual sugar have a Correlation Coefficient of 0.84 which might indicate a strong positive linear relationship between them.
Let’s closely take a look at these variables.
Looking at this scatter plot we can see the moderate linear relationship between the two variables. The median and average alcohol are increasing as the quality increases. Better white wines tend to have more alcohol content.
Here, we can see that medium white wines have the highest variance of alcohol while low quality white wine typically fall below 12% alcohol. Although there are premium white wines with alcohol less than 10%, the majority are above this threshold.
Here we can see some extreme values for density in the medium quality white wines. Let’s focus on the bulk of our data and see what is happening.
We can see that for high quality white wines the density has, on average, lower values.
Low values of density are more common for high quality white wines.
The highest variance of chlorides with bigger values are in the low and medium quality white wines. Based on the above graphs, on average, chloride is higher in low quality white wines. Or, good quality wines are less salty.
Let’s take a look also at two correlated features: Density and Alcohol.
Here we can see the strong negative relationship between density and alcohol. We can see that as alcohol increases, density tends to decrease.
Low density white wines have higher variance and greater alcohol content.
Here we can better break our density-alcohol relationship. We can see how low alcohol levels in white wine relates to higher density values.
Residual Sugar
## # A tibble: 3 x 14
## quality.type mean_alcohol median_alcohol min_alcohol max_alcohol
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 low 9.85 9.6 8 13.6
## 2 medium 10.6 10.5 8.5 14
## 3 premium 11.4 11.5 8.5 14.2
## # ... with 9 more variables: mean_density <dbl>, median_density <dbl>,
## # min_density <dbl>, max_density <dbl>, mean_pH <dbl>, median_pH <dbl>,
## # min_pH <dbl>, max_pH <dbl>, n <int>
We can see that the sulphates variance is bigger for medium and premium quality white wines.
Sugar and density have a strong positive relationship with R^2 of 0.84.
Residual sugar has some bigger values. Let’s focus on the bulk of our data to better understand this relationship.
In the above scatter plot we can see the strong linear relationship between density and residual sugar. As density increases, residual sugar tend to increase in white wines. Also, it is visible how many of our white wine samples have less sugar.
High density white wines have a higher variance of residual sugar, while low density white wines are more compact, typically, with less than 10 g/dm^3.
The density of white wine varies in relation to its sugar content. From dryer white wines, with lower density, to sweeter white wines with higher density. It will be interesting to see how these trends relates to quality. We will look more closely in the next sections of the analysis. Also, we can see that there aren’t too many observations for our semi-sweet and sweet white wines categories.
First, alcohol explains 44% of the variance in quality and this indicates a moderate linear relationship. The median and average alcohol content are slightly increasing as the quality increases. Better wines tend to have more alcohol content. If we cut alcohol into a categorical variable from low to highest alcohol levels we can clearly see how low quality white wines have fewer observations for higher alcohol content.
On the other hand, density explains only 31% of the variance in quality, which indicates a weak to moderate negative relationship. Low values of density are more common for high quality white wines. If we cut density into two categories: low density and high density we can see that higher values of density are associated more with lower quality in white wine, typically below 6. Density also has a strong positive linear correlation with residual sugar with an R^2 of 0.84.
Another interesting observation about alcohol and density is the fact that they also correlate to each other which may cause multicollinearity issues if we use these two variables into a linear model. There is a strong negative linear relationship with a correlation coefficient of -0.78. Low alcohol levels in white wine relates to high density values.
Sugar and density correlates with each other. The density of white wine varies in relation to its sugar content. From dryer white wines, with lower density, to sweeter white wines with higher density. High density white wines have a higher variance of residual sugar, while low density white wines are more compact, typically, with less than 10 g/dm^3.
White wine quality is positively and moderately correlated with alcohol content. The strongest relationship I found is between density and sugar with a Correlation Coefficient of 0.84 which indicates a strong positive linear relationship.
These histograms shows the distribution of alcohol in white wines by alcohol levels, for each quality category. While the medium white wines distribution looks more normal, low quality white wines are more skewed to the right, with more lower alcohol levels white wines. The bulk of the data for premium quality white wines is more pulled to the left, with higher alcohol content.
The previous alcohol distributions by alcohol levels histograms clearly better shows how each Quality Type is structured in relation to alcohol. In this scatter plot our intuition that alcohol levels play a role in determining quality is further shown.
If we plot the median line for quality versus alcohol by each alcohol level, we can see that the variance is quite large. Based on the median line, lower alcohol content in white wines don’t have a rating higher than 6, while highest alcohol level content doesn’t have white wines with quality less than 6.
While quality 6 has slightly more low density white wines, quality 5, 4 and 3 has more high density white wines and quality 7, and 8 has more low density white wines.
Here we can better see that chlorides is lower for the premium white wines.
The same pattern, high density white wines with lower alcohol content.
These density plots shows the distribution of white wine quality for each alcohol level by high and low density. We can see that white wines with moderate alcohol indicate roughly the same amount of low and high density per each Quality Type, with slightly less density in higher quality white wines. What is important to note is that there are low and high density white wines in poorer Quality Types, more specifically in 5, 4 but this is more present in wines with less alcohol content. This may be due to the fact that quality and alcohol are positively correlated and alcohol and density are negatively correlated as well.
It is easier to see here how premium white wines have more low density observations as opposed to high density.
Here we can see that although is it likely to find low alcohol white wines with low density, it is less likely to find higher alcohol white wines with high density.
High density white wines tend to be more sweet. We can see that most low density white wines are dryer while high density white wines are sweeter.
Here we can see the same pattern but for each quality. Higher density with higher sugar levels, more often found for medium and low quality white wines.
We can see that residual sugar do account for variance in quality with premium quality white wines tend to have less sugar and higher alcohol levels.
From the features that I have looked in, alcohol seems to be the most important feature in determining white wine quality with a combination of various other features. Lower alcohol levels in white wines, based on the median value, don’t have a quality rating higher than 6, while highest alcohol levels content don’t have white wines with quality less than 6. The more moderate alcohol levels varies from quality 5 to 7, so in the more superior quality ratings. Medium quality white wines don’t vary too much based on their density level. The difference comes in the more extreme Quality Types, with lower quality white wines and higher density and higher quality with lower density. It is interesting to note that moderate alcohol content in white wines have a similar density levels per each Quality Types. Higher alcohol levels have more high density observations. Although it is likely to find low alcohol white wines with low density, it is less likely to find higher alcohol white wines with high density.
High density white wine tend to have lower alcohol content and they tend to be sweeter. Also, premium white wines tend to have less sugar.
Let’s see how much of White Wine quality is dependent on the input variables by fitting a regular linear model.
##
## Call:
## lm(formula = I(quality.nr) ~ fixed.acidity + volatile.acidity +
## citric.acid + residual.sugar + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + density + pH + sulphates + alcohol,
## data = pf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8348 -0.4934 -0.0379 0.4637 3.1143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.502e+02 1.880e+01 7.987 1.71e-15 ***
## fixed.acidity 6.552e-02 2.087e-02 3.139 0.00171 **
## volatile.acidity -1.863e+00 1.138e-01 -16.373 < 2e-16 ***
## citric.acid 2.209e-02 9.577e-02 0.231 0.81759
## residual.sugar 8.148e-02 7.527e-03 10.825 < 2e-16 ***
## chlorides -2.473e-01 5.465e-01 -0.452 0.65097
## free.sulfur.dioxide 3.733e-03 8.441e-04 4.422 9.99e-06 ***
## total.sulfur.dioxide -2.857e-04 3.781e-04 -0.756 0.44979
## density -1.503e+02 1.907e+01 -7.879 4.04e-15 ***
## pH 6.863e-01 1.054e-01 6.513 8.10e-11 ***
## sulphates 6.315e-01 1.004e-01 6.291 3.44e-10 ***
## alcohol 1.935e-01 2.422e-02 7.988 1.70e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared: 0.2819, Adjusted R-squared: 0.2803
## F-statistic: 174.3 on 11 and 4886 DF, p-value: < 2.2e-16
We can see that from the regression summary, all the input variables explain 28.03% white wine quality. Based on the p-values, there are some features that can be used to better predict white wine quality like: fixed acidity, volatile acidity, residual sugar, free sulfur dioxide, density, pH, sulphates and alcohol.
The white wine first bar chart shows how our observations are categorized into 7 ranks, from 3 to 9, lowest to highest quality. Due to the fact that the more extreme categories are not well represented in the data set I created a new ordinal data type with just three ranks: low, medium and premium.
## pf$quality.type: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.20 9.60 9.85 10.40 13.60
## --------------------------------------------------------
## pf$quality.type: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## pf$quality.type: premium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 10.70 11.50 11.42 12.40 14.20
On average, alcohol content in white wines increases with about 1% units as quality increases from low to premium white wines, from 9.85% alcohol in low quality white wines to 11.42% alcohol in premium quality white wines. The same is true also for the median values: from a low of 9.6% to 11.5% alcohol in premium white wines.
Alcohol and density have the highest coefficient correlations with the numeric value of quality with an R^2 of 0.44 for the first and -0.31 for the second. Also, density and alcohol have a strong negative linear correlation with an R^2 of -0.78. In relation to quality, there are more premium white wines with higher alcohol content and lower density and low quality white wines with higher density.
First, I saw that the data set is not well balanced and this can cause bias to the results. The observations are not uniformly distributed across the quality categories. Therefore, I created a more inclusive quality categorical variable with only three levels: low, middle and high.
Then, I saw there are some main features of interest in our data set like alcohol and density. The median and average alcohol are increasing as the quality increases. Premium quality white wines tend to have more alcohol content. And, low values of density are more common for high quality white wines.
What is interesting is that density and alcohol also share a strong negative linear relationship. As alcohol increases, density tends to decrease. Density also has a strong positive linear relationship with residual sugar with an R^2 of 0.84, as density increases, residual sugar tend to increase in white wines. From dryer white wines, with lower density, to sweeter white wines with higher density.
Limitations about this analysis may be how the data set is structured, we may need more observations for most extreme white wine quality categories. A more inclusive data set may allow for a better comparison. There are also some outliers that can bias the results of the analysis. From the 11 chemical properties of white wine, it is important also to take into consideration only important features for white wine quality prediction.